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ABSTRACT 

We combine in a single framework the two complementary benefits of x 2 -template fits and 
empirical training sets used e.g. in neural nets: x 2 is more reliable when its probability density 
functions (PDFs) are inspected for multiple peaks, while empirical training is more accurate 
when calibration and priors of query data and training set match. We present a x 2 -empirical 
method that derives PDFs from empirical models as a subclass of kernel regression methods, 
and apply it to the SDSS DR5 sample of > 75, 000 QSOs, which is full of ambiguities. Ob- 
jects with single-peak PDFs show < 1% outliers, rms redshift errors < 0.05 and vanishing 
redshift bias. At z > 2.5, these figures are 2x better. Outliers result purely from the discrete 
nature and limited size of the model, and rms errors are dominated by the instrinsic variety of 
object colours. PDFs classed as ambiguous provide accurate probabilities for alternative solu- 
tions and thus weights for using both solutions and avoiding needless outliers. E.g., the PDFs 
predict 78.0% of the stronger peaks to be correct, which is true for 77.9% of them. Redshift in- 
completeness is common in faint spectroscopic surveys and turns into a massive undetectable 
outlier risk above other performance limitations, but we can quantify residual outlier risks 
stemming from size and completeness of the model. We propose a matched x 2 -error scale for 
noisy data and show that it produces correct error estimates and redshift distributions accurate 
within Poisson errors. Our method can easily be applied to future large galaxy surveys, which 
will benefit from the reliability in ambiguity detection and residual risk quantification. 
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1 INTRODUCTION 

Photometric redshifts are an attempt to attribute redshift values to 
locations in colour space occupied by objects for which we do not 
have spectroscopic redshifts. Statistically more useful is the aim 
to attribute expected redshift distributions n(z) to these locations, 
which are correct in a frequentist interpretation. However, photo-z 
practitioners are often limited to determine a Bayesian probability 
distribution p(z), which resembles the state of our knowledge, but 
differs from the frequentist n(z) by manifestations of ignorance 
that have to be incorporated to safeguard against known unknowns 
in the data and in the redshift model. 

Photo-z's are obtained using a model expressing the expected 
colour as a function of redshift and a variation of possible intrin- 
sic colour-affecting parameters. These models come in two distinct 
flavours with different advantages: (i) template-based models allow 
the observer to interpret data in empirically unexplored territory by 
extrapolating the templates in magnitude and redshift space; (ii) 
empirical models use a subset of the observed objects with inde- 
pendently known redshift, also known as training sets, which are 
conveniently in the same calibration system as the objects to be 



estimated. If the training set is truly random, it will also provide 
the correct priors to the statistical re dshift estimation . If it is not, 
a weights approach as suggested bv lLima et alj {2008) helps to fix 
the priors. As a result, smaller or pioneering photo-z surveys have 
no choice but to use template-based models, while large surveys 
with small Poisson errors on any of their results wish to control 
their systematics in the best possible ways and prefer the empirical 
model, that minimises systematics in the calibration and priors. 

After obtaining the data and choosing the model, there re- 
mains the choice of estimation code to relate the two. Currently, 
the two perhap s most advoca t ed cat egories are x 2 - me thods (e.g. 
iBenitezI d2000h ; or lWolf etal] dl999h for galaxies and QSO s) and 
artificial neural nets (ANNs, e.g ICollister & Lahavl |2004) . Ear- 
lier work has included global or piece-wise polynomial fittin g 
between colours and redshift lKoolll985l ; [Connolly et al.lll995h . 
Later the empirical fitting approach has develop ed into kernel re- 
gression method s to optimise local fits (e.g. [Wang et al. 1 20071 : 



Boris et al. 2007); these include nearest-neighbor techniqu es (e.g. 
Csabai et al. 120031) and support vector machines (e.g. IWadadekan 
20051) . iBudavaril (2009) articulates a unified framework. Here, we 
note the following general characteristics: 
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(a) x 2 model testing assumes a parametrised model to be free 
of error by definition and then uses error information on the data to 
determine probability density functions (PDF) and hence estimates 
of expectation values and likely errors for the parameters. If the 
model is correct and error-free, the PDF is expected to be correct, 
whether the model originates in templates or empirical data. 

(b) Kernel regression uses model realisations with any origin 
and error properties to estimate a mapping from object features 
onto parameters and requires smoothing over a local region of the 
model. When the smoothing (kernel) function is a Gaussian that 
resembles the data errors, it is identical to x 2 model testing. 

(c) Conventional ANNs with a single-number output acting as 
a parameter estimate deliver unique results. If several parameter 
values are possible given the same input features, they tend to settle 
for the most likely one. Errors can be estimated e.g. by resampling 
the input object as a Gaussian on its error distribution and collecting 
the outputs into a PDF, which may deliver the possible parameter 
range of the main solution but might still not help with ambiguities. 
Probabilistic neural networks (PNNs) which output redshift PDF 
vectors are currently explored. 

The advantage of \ 2 model testing and all PDF-generating 
techniques is that they evaluate a probability distribution across the 
range of considered parameters (e.g. redshift) and hence provide 
a warning signal for ambiguities arising from multiple solutions 
corresponding to local \ 2 minima. This includes nearest-neighbor 
techniques that produce PDFs after resampling t he input objec t on 
its error distribution (as shown for QSOs by Ball et al. 200^). In 
contrast, conventional ANNs deliver the same unique redshift es- 
timate when presented with the same input on different occasions, 
and thus do not record the relative likelihood of alternative solu- 
tions. 

Traditionally, practitioners have co mbined tem plate-based 
models with x' 2 -techniques (starting with iBauml 1 19621) and em- 
pirical training s et models with ANNs or kernel regression (e.g. 
lFirthetalJl2003l) . However, the reliability of x 2 -PDFs has been 
plagued moderately by model deficiencies that could be overcome 
by using empirical models. This has inspired the following explo- 
ration of the ^-technique with empirical models, which is an at- 
tempt to combine their respective advantages and derive PDFs that 
are statistically correct and reliable. 

In this paper, we choose to look at photometric redshifts for 
QSOs in order to confront us with a dataset full of ambiguities (see 
Sect. 2). We generally use Gaussian kernel functions, and in partic- 
ular a pure x 2 -empirical approach, described in Sect. 3, on nearly 
noise-free data. In Sect. 4 we discuss the resulting performance and 
summarise persistent issues. We look particularly at redshift am- 
biguities and show how we can use ambiguous objects in further 
analysis. Sect. 5 aims to give analytic explanations for the origin 
of redshift error floors, biases and outliers, and supports them with 
examples from the data. It also provides a framework to evaluate 
outlier risks in data sets beyond spectroscopic completeness. 

In Sect. 6 we explore the requirements for the x 2 error scale 
in the presence of model errors, which allow us to bring the error 
estimates from the width of the PDF in line with the true redshift 
errors and to predict how they deviate with different choices of er- 
ror scale. We note the potentially conflicting interests of optimising 
a kernel smoothing scale, and propose an approach that combines 
requirements for smoothing and the statistical error scale in one 
choice. Finally, using data and a model with different noise lev- 
els we demonstrate a reconstruction of a redshift distribution that 
shows only deviations in line with Poisson uncertainties. 



2 DATA 

The purpose of this experiment is to combine the advantages of em- 
pirical training samples (calibration and priors implicitly correct) 
with the advantage of the x -method (ambiguity warning based on 
a full PDF). We wish to use a data set with plenty of ambiguities to 
evaluate the benefits of our method. 

For most purposes, a large sample of galaxies would be most 
relevant. However, the only observed galaxy samples large enough 
for empirical training are provided by SDSS at relatively low red- 
shifts 2 < 0.3, while strong ambiguities only appear for galax- 
ies at z > 1. This is why conventional neural networks and 
nearest-neighbor-techniques have produced extremely robust red- 
shift estimates of the SDSS galax y sample with precisions of a z ~ 
0.02 and virtually n o outliers jFirth et alj|2003l : ICsabai et al.ll2003l : 
lOvaizu et alj|2008ah . Optical QSO samples are, however, full of 
redshift ambiguities and thus an ideal testing ground for our pur- 
pose. 

We opted for the SDSS QSO catalogue by [Schneider et all 
d2007h . which is based on SDSS DR5 data but further cleaned and 
amended. It contains ~ 77, 000 QSOs ranging in redshift from 0.08 
to 5.4, and includes SDSS ugriz photometry as well as 2MASS 
data for matching objects. A morphology flag marks extended-vs.- 
unresolved sources (M = 1 or 0), and where we include it in the 
X 2 we assume a fiducial error of om = 0.2 (though this choice 
makes little difference as M does not carry critical information). 

Most objects in the catalogue have vanishing photometric er- 
rors on their ugriz measurements, as they are all from a sample 
which was sufficiently bright for complete spectroscopic follow- 
up. Exceptions are the bluer bands in z > 2.5 QSOs, which con- 
tain redshifted intergalactic Lyman forest absorption that renders 
objects fainter and possibly undetected. We found that errors on ob- 
served object colours are usually smaller than the intrinsic colour 
variations exhibited by QSOs at fixed redshift. As a result, our red- 
shift estimation process is limited by the intrinsic properties of the 
model and not be the data quality. We are thus always in a quality 
saturation domain, which is appropriate for the study of system- 
atic redshift biases and ambiguities as these are not overshadowed 
by large statistical errors from low signal-to-noise measurements. 
Fainter and noisier samples would be additionally affected by wider 
confidence intervals for the observed photometry and would thus 
show wider PDFs with larger true and estimated redshift errors. 

We clean this catalogue by eliminating objects with missing 
magnitudes in one or more bands and objects with untypically large 
photometric errors, eliminating in total 1659 of the 77429 objects 
(~ 2%). The remaining sample of 75770 objects is split half and 
half into a model sample and a data sample, using even-numbered 
and odd-numbered objects, respectively. The distribution of the two 
samples is random and statistically similar in terms of magnitude, 
redshift, and sky position. 



3 METHOD: EMPIRICAL x 2 ESTIMATION 

The x 2 -method rests on the following conditions to work properly: 

1. We compare our data to a model, which needs to represent 
the possible data appropriately. Using a sample from within the 
observed data set for the model already ensures that data and 
model are on a consistent calibration; this is often not the case 
when external models are used, whether they are template-based 
or empirical data from an independent project. 

2. The parameter space of the model needs to be broad enough 
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to cover all local minima of the ^-distribution, which represent 
alternative interpretations of the data. Thus, the model sample 
needs to cover the whole range of parameters expected in the 
data set; ideally it is a random subset, then the statistical priors 
will be correct implicitly. 

3. In practice the data-model comparison is probed on a discrete 
grid, which needs to sample the data error distribution properly. 
Thus, the model needs to be enlarged until its density avoids 
undersampling; this issue is especially critical for data with am- 
biguous interpretation: if the technique is expected to be sen- 
sitive to a lower-probability secondary solution, this is the one 
driving the sampling requirements. 

4. We need to know the errors of our data, so that differences be- 
tween data and model are translated into x measures and hence 
probability density functions, while the model is presumed to be 
error- free. 

The empirical \ 2 -method is virtually identical to the regularly 
employed template-based method. The only difference is that we 
use an empirical set of objects as a discrete model realisation. If we 
trust that the model is a random subsample of the expected data, 
then we can use the empirical objects with all the same weight. 

We call Cij the components in the vector of observables for 
model object i. These components could be all the fluxes in differ- 
ent bands, or they could be colour indices, perhaps combined with 
a single flux value to provide a normalisation. In the case of QSOs, 
we note that their strong luminosity evolution compensates the dim- 
ming with increased distance such that their magnitude distribution 
hardly depends on redshift below z w 2.5; at higher redshift, virtu- 
ally all the redshift constraints are in the colour signature from the 
Lyman forest. That implies that there is little prior information in 
their overall brightness, so that colour indic es contain basica lly all 
the redshift information (and explains why Ball et ah (200§j) have 
not found the magnitude priors to be useful). 

The probability of a single model object i to give rise to the 
observed data Cdataj of a given data object is then 



Pi oc exp 



1/2 Yj - [( C modcl,ij -Cdata.j ) / ° 3 ? 



(1) 



where Oj is essentially a smoothing scale for the weight of 
the association between a data object and a model object. In a 
Bayesian framework, we want Oj to be a correct statistical error on 
(cmodei.ij — Cdata.j) so that pi expresses the probability of model 
object i to give rise to the observation of the data object. If the 
model objects populate the space only sparsely, a smoothing scale 
is even motivated in a Bayesian framework (see below). 

The expectation value and error estimate for the redshift of 
the given object follows trivially from the whole model set, after 
normalising the pi to JJ. pi = 1: 



(zphot) = y^Pi x z i 



(2) 



1 + (-Zphot) 



— —^2pi(zj - (zphot)) 2 ■ (3) 



We use o z as a redshift quality ranking, and as we show in 
Sect. 4 objects with low o z have small true redshift errors as well. 

The full PDF p ^- s (z) is approximated by the combination of 
all discrete instances (i.e. objects) in the model. It could be repre- 
sented e.g. in discrete z-bins after sorting all model objects with 
weight pi into the bins for Zj. Any shortcomings of this PDF result 
from the discrete nature and finite size of the model sample. 



Owing to abundant ambiguities in the data, a good fraction 
of PDFs contain two separate peaks. We thus test the PDF for bi- 
modality and try to deblend it instead of simply adopting the mean: 
To this end we split the redshift range into two intervals separated at 
(zphot) and obtain two local solutions ((z p hot)> <r z )i/2- This could 
already be seen as sufficient if the probability integrals contained 
in the two different peaks were comparable. But if one peak is a 
lot less pronounced, the initial mean estimate may lie within the 
primary mode, so that the z-limit between the two intervals splits 
off and diverts some of the signal from the primary mode into a 
contamination of the secondary solution. 

Hence, we do one more iteration, changing the splitting point 
to a location in the middle of the two alternative estimates, and 
redo the estimates again. If the PDF is in fact bimodal, these two 
solutions represent the two modes just as well as a single mode is 
represented by the original estimate over the full range. But when 
a uni-modal PDF is over-deblended with this approach, we find 
the two resulting estimates to be very close in redshift. We decide 
in favour of a bimodal PDF by requiring the redshift difference 
between the two deblended solutions to be 



1 + Zphot, 2 
1 + Zphot, 1 



1 > 0.4 . 



(4) 



This heuristic limit was chosen after visually inspecting a few 
hundred PDFs and their formal solutions. We note, that this dis- 
tance requirement corresponds to the width of the redshift interval 
over which the PDF is in tegrated for the ODDS parameter in the 
BPZ code teenitedl2OO0h . When we consider the PDF bimodal, 
then we flag the object as ambiguous, record the two solutions 
and determine their relative probability fraction from integrating 
the PDF over the two ranges. This procedure is sensitive even to 
ambiguities with a p-ratio smaller than ~l-in-20. 

The method can be trivially generalised to other object classes, 
and thus objects can simultaneously be classified on the basis of 
relative class probabilities and have their class-internal parame- 
ters such as redshift estimated within a single framework. This ap- 
proach has been dem onstrated with template-base d methods in the 
COMBO-17 survey dWolf et alj|2^0ll.l2004[2003) . 

If the model is not entirely appropriate for the data, the pi 
could be biased away from the correct solution, which may then 
appear improbable, and thus estimation mistakes might be made 
confidently. While this is unlikely to happen with empirical models, 
it can easily result from a choice of inappropriate or incomplete 
templates or priors, that lead to mismatches between the calibration 
of data objects and model objects. Even with perfect match, large 
grid steps can produce discretisation effects when the templates are 
treated numerically as a discrete grid by the photo-z code. 

In a template approach <jj is often enlarged beyond the noise 
in the data object to include an estimate of essentially unknown 
but plausible errors in the model using it| = a^odci %j + °"data,j- 
This has been implemented using eit her constant valu es providing 
an error floor on object colours (e.g. IWolf etal.ll200lh or template 
error f unctions depending on wavelength and template parameters 
(e.g. iBrammer et aiT 2008). As a result, the PDF widens to include 
the correct solution even when the model is biased, and reduces the 
rate of catastrophic estimation outliers. However, it should tend to 
overestimate the statistical redshift errors. 

In the following section, we first implement a x 2 -error scale 
that mimicks a template approach by choosing Oj — o"data,j and 
pretend the model to be free of errors. Although model biases are 
not an issue for our empirical model, the presence of model scatter 
is relevant, but we reserve its rigourous treatment for Sect. 6. 
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Figure 1. Photo-z quality vs. estimated error <r z (non-ambiguous objects 
only); solid lines show cumulative samples with u z < <r z i; mit and dashed 
lines differential samples at o z = cr z i; mit . Top left: The outlier fraction is 
generally below 1%, but it diverges for u z — * as the tolerance 3 X a z 
goes to zero as well. Top right: The error estimates <r z are usually larger 
than the true rms redshift scatter. Bottom left: The bias of the cumulative 
samples remains within ±0.003. Bottom right: Only 59% of the objects are 
classed as unambiguous, and choosing a z < 0.1 selects most of them. 

4 RESULTS USING DATA ERRORS ONLY 

4.1 Overall performance: RMS, bias and outlier rates 

For general discussion we use the data set with ugriz photometry 
and the morphology bit, although we briefly comment later on vari- 
ations that drop the morphology bit or include relatively shallow 
NIR photometry from 2MASS. We investigate the photo-z quality 
for a continuous sequence of sub-samples ordered by the expected 
redshift error a z . Here, we first eliminate all objects flagged as am- 
biguous (~ 41%) and discuss them separately in Sect. 4.3 and 5.3. 
We describe the true photo-z error of each object as 

c- •S'phot •S'spec ,,-\ 

oz = — — . (5) 

1 ~r z S p CC 

We determine the photo-z quality both for differential samples 
of objects with expected errors in a small interval around <r z = 
o~ z.iimit and for cumulative samples of objects with expected errors 
up to a limit, a z < (T z ,u m it. We characterise the photo-z quality of 
any sample with the following numbers: 

1. A fraction of outliers with \Sz\ > 3 x <7 Zj n m it 

2. A typical photo-z error, i.e. the rms Sz of non-outliers 

3. A photo-z bias, i.e. the mean Sz of non-outliers 

4. The fraction of the sample with cr z < a Zt u m it among the full 
data sample, i.e. the completeness. 

The results are presented in Fig. [T] Outlier rates (top left) are 
generally below 1%. The rate goes up as the tolerance goes to zero, 
just because the true errors remain firmly above zero (at u z jj m j t < 
0.02 an object with e.g. \5z\ = 0.06 is already an outlier). Outliers 
are more common at a z > 0.1 as well, but overall sufficiently rare 
as to not affect the cumulative samples much. 

The true redshift errors (top right) are on average correctly 
ranked by the estimated errors a z as shown by the monotony of 



the dashed line representing the differential sample ordered by a z . 
Objects with expected a z < 0.05 have a true Sz rms of < 0.05 
as well, but at a z > 0.05 errors are overestimated. This is not a 
desirable statistical property and results from ignoring model errors 
in the x 2 - em piri ca l approach. We explain this in depth in Sect. 6 
and propose an appropriate procedure after discussing the interplay 
of noise and smoothing scales in producing error estimates. 

The photo-z bias (bottom left) is nearly zero for good-quality 
objects and always < 0.003 for cumulative samples. Globally, the 
method is designed to be bias-free, but non-random sub-samples (as 
the ones plotted here) can always be locally biased. The fraction 
of objects peaks at a z ~ 0.05, but a cumulative sample must be 
relaxed to <r z ,ii m it ~ 0.1 to be > 50% complete (bottom right). 

In summary, a good-quality subsample selected by a z < 0.1 
contains half of all objects, shows a bias of —0.0015, a Sz rms of 
0.04 and 0.9% outliers. In the following, we investigate the depen- 
dence of performance on the desired sample completeness. 

4.2 Selecting subsamplcs by estimated photo-z quality 

Many scientific applications are driven by combined requirements 
for sample size and sample quality. We could thus prefer to choose 
a quality cutoff for samples from diagnostic diagrams of quality 
against completeness. Thus, Fig.[2]shows the quality numbers over 
completeness for cumulative samples. We differentiate between 
overall samples (top row) and high-z only samples (bottom row). 

Our default data set is shown by the solid line. The plots con- 
firm that the outlier rate is small for all samples of medium-to-high 
completeness (top left panel). The rms redshift error remains below 

0. 04 up to ~ 50% completeness (top centre panel), and only starts 
shooting up when including the worst 15% (in terms of <r z ) of the 
unambiguous sample. The redshift bias is < 0.003 for any selec- 
tion of unambiguous objects (top right panel). All solid lines in the 
top row end at 59% completeness as the flagged ambiguous objects 
are excluded here. 

In the high-z sample {z > 2.5, bottom panels), there are fewer 
ambiguities expected due to the strength of the Lyman-forest ab- 
sorption. The completeness reaches 85% as there are only 15% de- 
tected ambiguities. High-z QSOs can still be confused with low-z 
objects that have redder SEDs due to substantial host light contri- 
bution, but the outlier rate (mostly due to this effect) is less than 
1% at > 55% completeness. The cumulative mean redshift error is 
< 0.03, and changes little even to the highest completeness. This 
is simply because at high redshift almost all objects have small a z 
and true errors, and completeness increases very quickly with only 
a little change in u z ji m i t . Also, the bias amplitude is < 10~ 3 across 
most of the completeness range. 

Finally, two more lines represent alternative data sets: the 
long-dashed line is obtained when the morphology information 
is dropped from the \ analysis, and the dotted line is obtained 
when shallow JHK photometry from 2MASS is added. The results 
change little, but generally more information means fewer ambigu- 
ities as well as fewer outliers and smaller rms errors. In Fig. [2] we 
would seem to see on the contrary an increasing outlier rate with 
more information; this is, however, observed at fixed completeness 
and caused by the outlier tolerance shrinking at fixed completeness 
alongside smaller error estimates. 

The main effect of added information is to help with breaking 
ambiguities and thus enlarging the fraction of unambiguous PDFs, 

1. e. the completeness. Indeed the curves shift to higher complete- 
ness with more information, as the fraction of ambiguous objects 
shrinks. The latter declines from 48% in the ugriz-only data via 
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Figure 2. Cumulative photo-z quality in dependence of sample completeness: including more objects of worse and worse quality decreases the overall 
sample quality. Top row: Adding information on top of the ugriz photometry (2MASS data, morphology; weak constraints only) moves the curves to higher 
completeness. Bottom row: QSO photo-z's at z > 2.5 have low bias and rms values of 5z < 0.03. The whole high-redshift sample has only 0.3% outliers 
with Sz > 0.15. The shallow 2MASS NIR photometry and morphological data do not help in this redshift domain. 



41% when adding the morphology bit to only 35% after adding 
shallow JHK data as well. At z > 2.5 most of the redshift infor- 
mation is contained in the Lyman break, and neither morphology 
nor shallow NIR data help. 

Of course, the fraction of ambiguous objects is expected to 
collapse dramatically when adding data that really breaks degen- 
eracies such as deep GALEX U V data that helps with z < 2.5 
objects (see e.g. lBall et al.ll200of) . and deeper NIR data that would 
help with host galaxy light and higher-redshift objects. Adding such 
data, however, is contrary to the scope of this paper, which is to in- 
vestigate how our method deals with ambiguities (see next section). 



cm 



4.3 Redshift ambiguities 

Redshift ambiguities mean that objects from two or more different 
redshift regimes appear in the same region of colour space. For a 
given observed set of colours, two or more redshift solutions are 
possible, and the most we can know beyond the alternative num- 
bers is their relative probability. The ambiguity can only be broken 
by adding some discriminating information such as photometry in 
additional wavebands. Meanwhile, the practical question is how to 
deal with those ambiguities present. Here, we consider redshift bi- 
modalities as detected by our algorithm, and do not discuss higher- 
order complications. 

If redshift estimates are required for individual objects and sta- 
tistical redshift distributions for samples are insufficient, there are 
only two choices: (a) ignore ambiguous objects altogether, or (b) 
trust the more probable alternative in any bimodal PDF. Our sam- 
ple contains 15474 ambiguous objects with a mean probability ratio 
of 78.0%:22.0% for the more probable vs. the less probable solu- 
tion. If these probabilities are statistically meaningful, they ought 
to represent the success rate of trusting the more probable alterna- 



1 ' ' ' ' ' 1 1 1 1 

1 

Phigh 

Figure 3. Redshift ambiguities are measured roughly statistically correct. 
The fraction /high °f ambiguous objects, where the higher-redshift mode is 
correct, agrees with the probability fraction Phigh of this mode in the PDF. 

tive. In accordance with the estimate, we find for a fact that 77.9% 
of ambiguous objects have been attributed to the correct alternative 
(12051 measured vs. 12077 predicted, well within Poisson noise). 
We also note, that when the known spectroscopic redshift is used 
as a prior to choose the right alternative, the error properties are the 
same as those of unambiguous objects, irrespective of whether we 
look at the primary or secondary peak. 

Fig. [3] demonstrates that our relative probabilities of ambigu- 
ous objects are statistically meaningful even at a subtle level: here, 
we sort objects into narrow bins of the relative p-fraction for the so- 
lution at higher redshift, phigh, which is desired to represent the true 
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fraction of objects /high in this bin, for which the higher-z solution 
is the correct choice. The figure shows that indeed /high ~ Phigh 
everywhere. The relative p-fractions thus tell us roughly the risk 
associated with believing either one of the ambiguous redshift al- 
ternatives on an object-by-object basis. 

We do not know how sensitive our ambiguity detection really 
is to p-ratios of less than l-in-20, corresponding to Phigh ^ 0.05 
or Phigh > 0.95. Among our detected ambiguities 8% have ratios 
more extreme than l-in-20, and 1% even more extreme than 1-in- 
50. In any case, ambiguities at very low levels exist and will remain 
mostly undetected, causing catastrophic outliers at the correspond- 
ing small rate. If a sample of objects has an ambiguity level of 
l-in-50, e.g., they are likely to be classed unambiguous objects at 
their more probable redshift, and 2% of them are bound to appear as 
catastrophic redshift outliers. If e.g. 30% of an overall sample live 
in regions of colour space with such a level of ambiguity, this will 
produce a 0.6% fraction of unflagged outliers in the overall sample. 
All existing unflagged outliers in our sample are easily explained 
by low-level ambiguities remaining undetected. 

If trusting risky individual redshifts is unacceptable for the 
purpose at hand, samples of ambiguous objects can at least be used 
in a statistical sense, e.g. in the form of redshift distributions n(z) 
(see following section). 



4.4 Redshift distributions 

Several astrophysical applications do not necessarily require red- 
shifts for individual objects, but can do with redshift distributions 
for subsamples, e.g. in weak gravitational lensing. Such applica- 
tions can easily make use of ambiguous objects as well, for which 
redshift distributions are obtained reliably even though decisions 
between ambiguous alternatives are risky on an individual basis. 
The most general solution for any application would be of course to 
represent any object by its full redshift probability function (PDF) 
and give up on the concept of a single redshift value. However, we 
may still continue to use single values occasionally in the interest 
of data compression. 

In Fig. [4] we look at the summed up distribution of redshifts: 
comparing spectroscopic redshifts with photo-z's we find good 
overall agreement. For the photo-z distribution in the ambiguous 
sample we have counted every object twice, once at each of the al- 
ternative redshifts and using the relative probabilities as weights. 
In the left panel we have represented objects simply by a peak at 
their estimated redshifts (z p hot}- We expect that true structure on 
very small scales in redshifts space will be smoothed in photo-z 
distributions given that photo-z errors mean a lower resolution in 
observing redshift space. 

However, we find some oscillations in the n(zphot), which are 
not or only weakly present in the n(jz spe c); these have been termed 
redshift focussing by some practitioners. One of the reasons for fo- 
cussing is a local bias arising from n(z) priors inherent in the em- 
pirical estimation. Any local maximum in n(z) means that redshifts 
near the peak are a-priori more probable than in the wings. Objects 
in the wings then have their overall PDFs biased towards the peak. 
Peaks are thus overpopulated and troughs in n(z) are depleted. 

In the right panel we have used the full p(z) functions of each 
object and added them up to produce expectation values for the 
redshift distribution n(z). This is much more similar to the correct 
redshift distribution and indicates that the PDFs do contain very 
useful information beyond the redshift expectation value. 
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Figure 4. Redshift distributions n(z) from spectroscopic and photometric 
origins are similar. Left: Objects are represented by (z p hot); ambiguous 
objects are counted twice using the two redshifts with their relative proba- 
bilities as weights. Right: Stacking object PDFs matches n(z) much better. 



4.5 Size of model sample 

Generally, the model sample is required to cover the entire range of 
colours seen in the data sample as otherwise some parts of the latter 
have no appropriate model to compare with and will be interpreted 
wrongly. Hence, the volume of colour space covered by the model 
is fixed, and the size of the model, i.e. the number of its discrete 
members, then determines its local density in colour space. 

We first assume a model distribution without any ambiguities, 
which is thus not prone to catastrophic outliers. The local density 
of model objects in colour space now decides how well the error 
distribution of a data point is sampled. Undersampling can lead to 

• missing the peak of the PDF 

• a local bias in the most likely redshift and thus redshift aliasing 

• larger redshift errors and 

• an incorrect estimate of redshift confidence intervals 

Thus, the density required for optimum performance is such 
that even for well-measured objects with small photometric error 
ellipsoids, and even in sparsely populated areas of colour or redshift 
space, the sampling theorem is fulfilled. 

We now assume a model with ambiguities and focus on an af- 
fected location, where two branches of the model cross in colour 
space. Now, both branches are required to sample the error distri- 
bution of the given data point. Is the density of model points high 
enough to render even the less populated branch clearly effective, 
which is now the limiting factor if the ambiguity is to be detected 
reliably? In this situation, the model density required is a function 
of the desired sensitivity to ambiguities. Protecting yourself against 
the rarest possible high-ratio ambiguities requires clearly the most 
massive model set imaginable. 

Fig. [5] demonstrates the situation quantitatively by reporting 
the photo-z quality for model samples of different size, where the 
largest one is the original sample with 37885 objects and the small- 
est one has only l/64th of that, i.e. 592 objects. Again, the top row 
reports results for the entire sample, while the bottom row only re- 
ports on high-z objects. The two lines show the photo-z quality at 
two fixed levels of completeness as a function of iV mo dci, the num- 
ber of objects in the model sample. 

The most noticable change is a steep increase of outliers when 
the model shrinks, resulting from ambiguities with extreme p- 
ratios becoming invisible in more sparsely sampled, discrete mod- 
els when the last object of a more weakly populated branch dis- 
appears. In contrast, changes in the bias and rms errors among the 
non-outlying bulk of data objects are only moderate; these are de- 
fined by the intrinsicly densest parts of the model, and are the last 
statistics to change when the model shrinks to a near-useless size. 
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Figure 5. Photo-z quality vs. size of the model sample at fixed completeness levels. The quality saturates with ever larger model samples when the photometric 
error ellipsoids become massively oversampled. The properties of the good-quality bulk of objects (rms redshift errors and bias) do not change much with 
model size. Outliers (here for fixed tolerance \Sz\ > 0.2) increase with shrinking model samples whose discrete nature results in loss of sensitivity for redshift 
ambiguities, except for the high-z sample in which ambiguities are intrinsically rare. 



5 UNDERSTANDING PHOTO-Z ISSUES 

We remind the reader that we characterise the performance of pho- 
tometric redshifts by three parameters at given completeness, all of 
which we want to optimise, while they keep posing challenges: 

1. The rms of the redshift error is supported by the intrinsic scat- 
ter of object properties at fixed redshift; enlarging the model 
sample and getting ever more accurate photometry can not be 
expected to help in this situation. 

2. Sub-samples defined in colour, redshift or quality can have lo- 
cally biased photo-z's even though the overall sample has van- 
ishing bias by design. 

3. A small fraction of catastrophic outliers can always arise from 
ambiguities that remain undetected due to large probability ra- 
tios between the two alternatives; the only model to guard 
against all of these is unfortunately just the complete set of spec- 
troscopic redshifts for all objects in the whole survey. 

In this section, we try to understand these factors using ide- 
alised descriptions of a model and illustrate them with examples 
from our data set. We hence investigate factors affecting the local 
properties of photo-z's in small regions of colour space. 



5.1 Local RMS error support from intrinsic diversity 

We assume an idealised situation for an analytic approach: an ab- 
sence of global ambiguities, and a distribution of model and data 
following two simple rules in a local environment of colour space: 

1. The mean colour (c(z)) of objects at redshift z drifts linearly 
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Figure 6. Example for the local rms error support from model diversity: At 
z ~ 3.7 the main redshift information is in the g — r colour; at fixed colour, 
model redshift scatter is the same as the 5z rms error in data sample. 
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Figure 7. Variations in the N(z) of the model (i.e. the priors, centre left panel) cause a local bias of Zp^ot in samples selected by z Bpcc (centre right panel) 
but no bias of z spe c in samples selected by Zphot or colour (right panel). See text for more details. 



with z as (c(z)) = Co + f3(z — Zq)\ this applies to both model 
and data since there is no miscalibration between them. 
2. At every z there is a Gaussian distribution in the intrinsic error- 
free model colours c(z) with an rms cr c , model independent of z. 
The errors in the data colours are denoted by <7 c ,data. 

We can then infer for the model that at fixed colour c there is a 
Gaussian distribution in z with an rms of o" z>mo dci = o" c , model//?, 
because dz/dc = 1//3. An infinitely accurate colour measurement 
c in the data will be attributed a PDF with a mean redshift estimate 
(z(c)) = zq + (c — Cq)//3 and an rms of cr^model, which is equal 
to the rms of the true redshift errors found in a data sample. 

When a colour error in the data is introduced, the redshift will 
become less well constrained. Data points of objects at fixed z will 
then appear with an effective observed colour scatter of a c eff = 
°"c data + °"c model- The mean redshift estimate will be unchanged 
but the rms of the PDF will widen to 

2 

2 / //3\2 °c,data 2 ,,\ 

ff*,eff = (e r c,eff/Pj = — -p 2 remodel- (°) 

Obviously, as long as (J c .data < c C) model the intrinsic variety 
in the model supports a lower bound in the redshift error. The latter 
will only increase substantially due to noise in the data when it is 
larger than the model variety. 

We now illustrate this point with a concrete example from our 
data set, and go through the numbers explicitly. We choose objects 
with a colour of g — r w 1 that are mostly high-z objects scattered 
around z w 3.66. Their main redshift information is in the g — r 
colour index, which brackets the continuum step over the Lyman- 
a line due to the intergalactic Lyman-forest absorption, while the 
other colours provide only weak constraints. We plot the g — r 
colour of objects near this redshift in Fig.|6]and find a nearly linear 
colour-redshift relation. 

From the data sample, we select objects in a narrow colour 
interval of g — r — 1.00 ± 0.05 combined with a redshift in- 
terval of \z — 3.66j < 0.3 to exclude rare outlying objects and 
find 110 unambiguously estimated objects. We find their redshift 
errors to have a 5z rms of 0.024 and compare this now to the in- 
trinsic redshift scatter found in the model. From the model, we se- 
lect 117 objects using the same colour and redshift interval, and 
expect them to determine the main mode in the PDF of the se- 
lected data objects. We find the redshift distribution in the model 
to have a mean of 3.662 and an rms of 0. 1 15, which translates into 
a z = 0.115/(1 + 3.662) = 0.024. The data sample thus shows 
precisely the errors expected from the redshift scatter inherent to 
the model in the relevant location of colour space. 



In this example, the photometric noise is too small to con- 
tribute to the PDF and error sources. All objects in the data have 
g — r colour errors below 0™04. We fit a local slope for the redshift- 
colour relation near z ~ 3.7 and find dz/d(g — r) m 0.73. The 
propagation of photometric errors alone is thus expected to con- 
tribute only a Sz rms of 

dz Og-r_ = Q 73 x Q4 + 3 _ 6g ) w 0Q6 (?) 
d(g — r) 1 + z 

If the data included fainter objects with larger photometric 
errors, then the colour range of model objects contributing to the 
PDF would broaden and redshift errors would eventually be driven 
by the photometric errors: an object with g — r = 1, e.g., needs 
cr s -r > 0.15 (i.e. larger than intrinsic scatter) for this to happen. 



5.2 Local redshift biases 

We continue to use the simple approximation from the previous 
section and introduce a trend of N(z) in the model sample which 
will act as a prior; we assume a locally constant gradient dN/dz in 
an environment of zo with a normalisation of N(z.q) = 1. A brief 
calculation including this extension shows that the resulting PDF is 
skewed by the non-flat prior, the redshift estimate is biased and the 
estimated error shrinks. The expectation value moves by 

, x dN ( < data , \ dN 2 

{ZeB) — 20 = —j^ I ^2 V <Jz, model I = —j^O x,t& i°) 

where cr z>c g is given by Eqn.[6] and the formal rms is now 

/ 2 \ - 2 fdN\ 2 4 

\°"z, eft', biased/ — °z,cff — I —j— J f Z ,efl • (?) 

We illustrate this point with an example from a region where 
N(z) changes strongly with z in the model: Fig.|7]shows a region 
near z m 3 with a nearly linear colour-redshift relation (left panel), 
where N(z) changes by a factor of ~ 2 in the space of Az < 0.2, 
equivalent to dN/dz ~ ±5 (or ~ 20 if expressed onaz/(l+z) 
scale, see centre left panel). At z ~ 2.9 a positive gradient leads to 
an overestimation of z p hot and at z ~ 3.3 a negative one leads to an 
underestimation (see centre right panel). According to Eqn. [8] we 
expect average biases in Sz of ±0.02 at these two redshift points 
(given a z w 0.03), which is roughly what we find in the data: at 
z = 2.9 ± 0.05 the bias is (Sz) = +0.016 and at z = 3.3 ± 0.05 it 
is (Sz) = —0.018 (the use of a z/(l + z)-scale vs. a jz-scale may 
be confusing; all quantities need to be on the same scale). 
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Figure 8. Sparse models are less sensitive to ambiguity. Left: A quarter of all detected ambiguous objects using the full model are classed as unambiguous 
when the model is stripped to only 1 in every 64 objects, and many of those are then outliers. Centre: The colours for 108 outliers (z p hot < L but true 
z = [1.4, 1.9], see box in left panel) are plotted (black) on top of the plume of colours of the full model at z = [1.4, 1.9]. Right: The sparse model does not 
have much overlap with the 40 objects and suggests a near-zero probability for them to be at z > 1.4; their ambiguity is not detected any more. 



Finally, we want to emphasise that such a bias is indeed not 
particularly relevant for most photo-z applications as it is not ob- 
served, when 8z is plotted over z p hot instead of jz spcc (see right 
panel of Fig.O. By design, the application of priors produces the 
correct N(z spC c) for slices in z p i 10 t- or colour space, although the 
reverse is not true as exemplified above. 



5.3 Local undetected ambiguities 

The origin of outliers not flagged as ambiguous is a deficiency of 
the model. We try to give an order-of-magnitude estimate for this 
residual outlier risks. If we use the largest available model sample, 
these estimates quantify a maximum risk, which could only be con- 
strained further with the help from a hypothetical larger sample. If 
we choose a smaller model sample for the photo-z's on purpose, 
then we can still derive better estimates of the expected outlier rate 
from the larger one. 

For a given object in the data with colour c and error cr c , the 
PDF is mostly defined by the part of the model samples that re- 
sides with the central 2cr-contours of its error ellipsoid. In order 
to get any moderately reliable PDF, the model sample needs to 
have at least some objects in this area; a faint ambiguity, i.e. a sec- 
ondary branch with much lower object density in the same region 
of colour space, is detected only if there is at least one object seen 
from that branch. Furthermore, if we observe zero objects of a sec- 
ondary population, but assume a-priori that it is present with a flat 
prior on a density above zero, the expectation value for its density 
is (N) = 1 objects from Poissonian statistics. Thus, simply assum- 
ing a model with iVmodel.iocal objects in the central part of an error 
ellipsoid, the residual risk for an undetected faint ambiguity is 

P2nd = " ■ (10) 

^ * model, local 

This is the maximum oulier risk attached to any single object 
in this location of the data space. The risks-per-object (as output 
by the code) add up whenever defining larger data samples from 
different regions of colour space; they can not be reduced in Pois- 



sonian fashion by increasing the size of the data sample, but instead 
only by increasing the model sample. 

However, the outlier risk can be reduced by stacking the PDFs 
of several objects across a wider interval of colour space into a 
summed redshift distribution n(z), because the number of objects 
in the model contributing to all these PDFs together increases; in 
this case, AT mo del,local counts all objects in the area of colour space 
used for stacking. Of course, as we reduce the Poissonian outlier 
risk and make the PDF more reliable, we also reduce the redshift 
resolution as we integrate over a wider range of colour space. We 
are then choosing trade-offs between resolution and reliability. 

In Fig. [8] we illustrate the effect by comparing photo-z's ob- 
tained using a large model set with those obtained using a model 
set of only l/64th the size, which has been created from the large 
one by sparse but random sub-sampling. Looking at the whole data 
sample, we find that ~ 1/3 of the flagged ambiguous objects are 
not flagged as ambiguous any more when the sparse model is used. 
The left panel in Fig. [8] shows just these 5502 newly unambigu- 
ous objects. The majority of them has still good photo-z estimates 
as indicated by their location near the diagonal, but about 15% of 
them have become outliers as a result of overlooking their ambigu- 
ity when using the sparse model. 

We now focus onto an area containing 108 such 1/64-outliers 
in the range of z — [1.4, 1.9] and z p h t,i/64 < 1 ( see box in 
left panel). The obvious interpretation is that in the colour space 
occupied by those 108 (z>1.4)-objects, the model is dominated 
by objects of z < 1, although there is a secondary branch at 
z — [1.4, 1.9] that is visible only in the full model and diluted 
into discrete absence in the 1/64-model. Supporting this interpre- 
tation, we compare the colours of the 1/64-outliers (black) with 
the plume of model colours across the range of z = [1.4, 1.9]. In 
the centre panel, the full model is seen to cover the 1/64-outliers, 
hence their PDFs contain a mode within z = [1.4, 1.9], and ambi- 
guity is detected. But in the right panel, the sparse model is seen to 
mostly avoid the objects, which eliminates this mode in the PDFs 
and makes them outliers. Given the small colour errors in the data 
a colour mismatch of > 0™1 is already sufficient to suppress the 
probability of consistency in the \ 2 comparison. 
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5.4 Size and incompleteness in model sample; propagation 
into outliers and redshift bias 

From the arguments in the previous section it is clear that the size 
and any incompleteness of the model affects the residual outlier 
risk. We need to distinguish between two kinds of incompleteness, 
one from variations in targetting objects for spectroscopy, and one 
from variations in recovering redshifts from target spectra: 

• Incompleteness in targetting means that relatively fewer, but 
randomly selected objects in a particular part of colour space are 
included into the model sample. Thus, the priors are not implicitly 
correct anymore, but this can be compensated by applying explicit 
weights to the objects. 

• Incompleteness in successfully recovering redshifts from the 
spectra of target objects is likely to act non-randomly in redshift. 
It is usually a result of a spectrograph reaching different depth at 
different redshifts for different galaxy types, due to variations in 
the strength of spectral features and in their visibility within the 
instrumental wavelength range. This will omit an important part 
of the model sample and wipe out its corresponding representation 
in the PDFs. We point out, that the incompleteness of the PDF is 
not reduced by simply enlarging the model set, but only by mak- 
ing it more complete in the redshift recovery rate. Also, if the in- 
completeness affects parts of redshift space strongly by creating 
redshift deserts, it will translate directly into a similar fraction of 
undetectable outliers. 

The oulier risk translates into a redshift bias risk when averag- 
ing the mean redshift of a subsample containing the outliers. At the 
bright end, undetected outliers are more likely due to sparse sam- 
pling of the colour space as a random model sample will contain 
few bright objects in line with their natural scarcity. At the faint 
end, they are more likely the result of instrumentally driven selec- 
tive redshift incom pleteness and can reach dramatically high rates 
(see|Newman 2008, and references therein). Also, the raw sizes of 
data sample and model sample translate into simple Poissonian es- 
timates of the expected errors on the mean redshifts. 

The Poissonian error on the mean redshift of a subsample lo- 
calised in colour space is limited by the error on the mean redshift 
of the stacked PDF, i.e. the number of model objects and the width 
of their redshift distribution, as well as by the data sample, whose 
realisation may deviate according to its size: 



o» = <t z ,pdf x . / — + — . (11) 

y 1 "model, local Jv data, local 

The propagation of the maximum outlier risk into a maximum 
bias risk also rests on assumptions of their redshift distribution. The 
mean redshift error of outlying objects (fcout) in a local region of 
colour space may be anywhere between —1 + 1/(1 + z max ) and 
+Zmax- The maximum bias risk is then: 

non — rccov 
\ "model, local J 

where r7 non _ rccov is the incompleteness of the spectroscopic 
redshift recovery, and the second factor quantifies the sensitivity 
limit to ambiguities. If we approximate example numbers suppos- 
ing j(<5zout)! = 1, then a 20% spectroscopic incompleteness im- 
plies a maximum redshift bias risk of \{5z)\ = 0.2, far above any 
of the 5z rms values seen in our work. This is a maximum risk, and 
better constraints require a better model sample. 

This implies that spectroscopic incompleteness deserves by 
far the greatest concern in empirical redshift estimation work. This 



is not much of an issue for any of the bright SDSS samples, but very 
important for any empirical photo-z work in magnitude regimes 
where spectrographs do not provide near-100% completeness. 

These are rough estimates for overall samples, but again our 
code produces risk estimates per object, which allow elimination of 
uncertain objects from samples entering follow-up analyses. Our 
method is to put all spectroscopic target objects from the model 
sample without a reliably recovered redshift into a separate model 
and evaluate the fraction of the total PDF they account for in each 
individual object; this share of probability is attributed directly to a 
residual outlier risk for the object in question. 

5.5 Are there ideal sizes for model samples? 

We quantify the ideal size for a complete (tyion-rccov = 0) model 
sample, while for incomplete models the conclusion depend on 
the selections being made for the follow-up analysis. We assume 
that a data sample will be partitioned into subsamples with stacked 
PDFs, or each object will be considered on an individual basis 
(-Wdata.iocai = 1)> in either case, the critical factor is the size of 
the model in the local region of the subsample or the individual 
error ellipsoid, and our rms tolerance on the mean redshift. The 
following applies if mean redshift of a sample is the critical figure 
(as e.g. in gravitational lensing) while some applications (such as 
BAO measurements) may not depend so much on outliers. 

The potential rms error is given by Poissonian precision and 
residual outlier risks, and these change with different powers of the 
size of the local model sample (Eqn. [TT] and 1121 ). The two error 
sources have similar impact if 

iVmodel, local = l/o"z,PDF , (13) 

and their combined error estimate is 

0"( z ) = \/2/Ar mo dcl,local ■ (14) 

If the model has fewer objects than this, the mean redshift of 
data subsamples is limited by outlier risks that decline oc 1 /N , and 
if it has more it will be limited by redshift scatter that continues 
to decline oc 1/ yN. If the balance is not ideal in a given model 
sample and enlarging is not an option, there is the alternative of 
changing redshift resolution and making subsamples cover differ- 
ent ranges of redshift space. As long as the local colour-redshift 
relation is vaguely linear, this would change model size with the 
width of the redshift range, iV mo dci, local oc o" z .pdf; this changes 
outlier risks in the opposite direction to Poissonian precision and 
can be used to rebalance the error sources to an optimal mix. 

If we take model incompleteness into consideration, it dom- 
inates the error sources as soon as rjnon-recov > l/iV mo< jel,iocal- 
Median redshifts, in contrast, are only weakly affected by outliers. 

6 USING MODEL ERRORS IN THE x ERROR SCALE 
OR SMOOTHING FUNCTION 

In the previous sections we used a model sample that is almost 
noise-free in conjunction with a \ approach, which is only per- 
fectly reliable when the model is exactly free of noise. We observed 
that the error estimates showed deviations from the true rms error 
in Fig.Q]that we still want to explain. Also, future applications will 
rely heavily on redshifts of faint objects with noisy photometry. In 
this section we clarify the the role of noise for the choice of the \ 2 - 
error scale and its consequences for n(z) estimates, and take into 
account the requirement for smoothing as well. 



© 0000 RAS, MNRAS 000, 000-000 



Bayesian photo-z's with empirical training 1 1 



Assuming that we use a model with full completeness, the 
data sample and the model sample are drawn from the same par- 
ent distribution <f>(z, c) of objects in colour space. Noise, however, 
smoothes these distributions into a new density function p(z\c), 
and may differ between the data and model sample. If both sam- 
ples are smoothed to the same degree, their p(z\c) are identical. 
Then the photo-z PDF of any individual data object at location a is 
simply given by p(z\ci) as determined from the model. Otherwise, 
we apply an operation to make the smoothing scales consistent. We 
could also choose to smooth both data and model to a common 
larger scale. 

If we were prepared to give up the concept of a PDF for an 
individual object, we could define regions in colour space and at- 
tribute the integrated model properties in the region to the corre- 
sponding subsample of data objects distributed over that region. In 
the following, we differentiate two cases, that of a constant target 
smoothing scale, and that of one which varies across colour space. 



6.2 Spatially varying target smoothing scales 

Here, we discuss only the option of spatially varying smoothing 
scales that are identical for the data and model sample from the 
start. Thus, the error scales are already matched, i.e. Oj — in the 
X 2 -expression, hence typically no model object would be found at 
the location c, of a data object. Again, the only option is to give up 
on p(z)-expressions for individual objects, and to define instead a 
volume in colour space, over which data objects are combined into 
a subsample that has the n(z) of the model within the same volume 
attributed. However, not having p(z) for individual objects means 
a reduced resolution for the mapping from colour to n(z). 

Anyone desiring to use spatially varying scales which differ 
between data and model sample, will face the problem of finding 
the target scale for an object in dependence of its original location 
before the smoothing due to the present errors. This can only be 
done as a first-order approximation using a representation of the 
error scales that is already smoothed by the errors itself. 



6.1 Spatially homogeneous target smoothing scales 

A spatially homogeneous target smoothing scale is straightforward 
to deal with, as any object from the data or model sample needs 
to be smoothed further by an amount that is trivially determined. 
Every pair of data-model points can be compared separately: 

1 . If the errors of the model object are smaller than the data error, 
fmodci < ""data, we need to smooth the model object further by 
u ] = °data — °~modci- This is most easily achieved by replacing 
the model object with a Gaussian of width Oj and evaluating it at 
the location of the data object a . We can thus simply use the x - 
framework described before and use aj as the error or smooth- 
ing scale. Note, that this smoothing scale is obtained by sub- 
tracting the model error from the data error. In contrast, adding 
these errors into the smoothing scale smoothes the model too 
much compared to the data. In the case of an error-free model 
we recover the usual x 2 error scale aj = <7data- 

2. If the errors are nearly identical, then the smoothing scales are 
already matched. We find the smoothing scale aj — > and the 
number of model objects N — > contributing to the solution 
with diverging Poisson errors. Discretisation effects always call 
for a sufficient smoothing scale driven by the density of points 
in the model. We can choose a larger target smoothing scale for 
both, or we define p(z) only for subsamples distributed over 
a region in colour space. Thus, holding on to non-zero model 
smoothing requires to smooth data points as well. This not only 
wipes out previously present information, but needs to be car- 
ried out rigourously (as in 3.) to obtain an unbiased PDF. 

3. If the model error is larger than data error, <7 mo dei > Cdata, 
we need to smooth the data object, which we implement by 
resampling the data object as a Gaussian. Model smoothing is 
still desired for numerical reasons, and a common target scale 
for data and model needs to be chosen that is larger than either 
one. Data smoothing is done by resampling as a Gaussian G(ci) 
with width cr 2 esamp = cr 2 argct - cr 2 ata at many a,j, and model 
smoothing as before by evaluating a model Gaussian at the a,j 
as in (1.), whereby a] = cr t 2 argc t - ff^odei- 

Having p(z) for individual data objects conserves resolution 
in the data sample that is lost when a common p(z) is attributed to 
subsamples. But it requires that either data errors are larger than 
model errors from the start, or noise be introduced into the data 
after observation, due to a discrete model asking for smoothing! 



6.3 Error propagation through the x 2 -expression and the 
ideal smoothing scale 

We continue to use the idealised example of a locally linear c(z)- 
relation from Sect. 5.1, only that the model is now considered to 
have a measurement error as well. Thus at fixed redshift, the model 
has a Gaussian scatter a\ model = <r 2 model _ int + cr 2 model _ orr that 
results from intrinsic scatter convolved with measurement errors. 

A data sample at true redshift z has a colour scatter CT c ,data = 
°"c,modoi-int + °"c,data-crr that results from the same intrinsic scat- 
ter but convolved with the data measurement errors. It translates 
into a corresponding rms scatter in the redshift expectation 
values given the local slope of the c(z)-relation: 

<*{z) = ( fT c,model-int + a c, data-error ) X (^T^ ' 

This result does not depend on the choice of the x 2 error scale. 
Only the width of the PDF, and hence the redshift error estimate a z , 
depends on the x 2 error scale, as it is a convolution of the model 
scatter and the model smoothing scale Oj : 

&z = (o"c,modcl-int + fc.modcl-crror + X ■ (^) 

Requiring that the PDF and error estimates are representative 
of the true rms redshift errors, we ask that a( z ) = a z , which is 
fulfilled as in Sect. 6.1 when using 

&j ^c, data — err ^c, model — error O^) 

for a matched error scale. This implies again that the model 
objects have smaller errors than the data objects, so that the de- 
mands of a non-zero smoothing scale to overcome discretisation 
effects and of the correct error scale can be met simultaneously. 
The photometry of the query data set only needs to be as good as 
the lower limit provided by a desired smoothing scale derived from 
the density of the model points. Otherwise, we have to uphold the 
desired smoothing scale by introducing additional colour and red- 
shift scatter into the data by resampling the data objects to a larger 
matched error scale. This means that the photometry of our original 
query data set was too good to be useful. Conversely, the level of 
data errors drives the necessary density of model points to suppress 
outlier risks as discussed in Sect. 5. 
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Figure 9. Comparison of redshift distributions n(z spe c) and PDF-stacked 
n(2pl 1 ot); data and model errors are matched (left) or added (right). The 
bottom panels show the difference of the n(z) over the expected Poisson 
eiTor for every redshift bin: Matched errors produce nearly Poissonian n(z) 
without bias, while added errors produce biases and larger differences. 



We point out the consequence of adding model and data errors 



in the \ 2 error scale: using a? — o\ data- 



J j " c, data — err 1 c, model - 

expect to overestimate the true errors by a factor of 



'1 + 



2(T 2 

c, mo del — error 



c, model — int 



(18) 



A brief numerical experiment is conducted to verify these con- 
siderations: we scatter all our data objects to reach a new error of 
0^1414 in every colour index, and all our model objects to reach 
now 0™1. Following Eqn. [l5]we expect the matched error scale to 
produce an rms Sz error similar to (cr z ), the mean width of the PDF, 
while the added error scale should enlarge (a z ) by a degree that de- 
pends on the intrinsic model scatter and in our case may reach up 
to a factor of a/2 where the intrinsic scatter vanishes. 

We investigate first the high-z regime, where colour and red- 
shift follow a simple relation. Looking at all objects with z > 3 
but excluding statistical outliers with \Sz\ > 0.1 or a z > 0.1, 
we measure the mean width of the PDF and the rms 8z error. In 
Sect. 5.1 we measured an intrinsic scatter in the model colours of 
c c ,model-lnt — 0.15 in this redshift regime, which predicts an in- 
crease in a z by a factor of ~ 1.2 using Eqn.1181 

The matched error scale produces an rms of 0.0291 and 
(<j z ) = 0.0291 in perfect agreement. With the added error scale 
the rms remains almost unchanged at 0.0279 but the error estimate 
is increased by a factor of ~ 1.25 to (er z ) = 0.0366. We take this as 
empirical evidence that the added error scale overestimates errors 
in line with the analytic expectations from the idealised example. 

As a result the stacked PDF from the entire data sample could 
be wrong near structures in n(z) or edges. In Fig. [9] we compare 
them to the spectroscopic redshift distribution for both error scales. 
The histogram plots (top row) make it difficult to spot the small 
differences, but in the bottom row we show the difference between 
the redshift histograms scaled by the expected Poisson noise in each 
redshift bin, which is a% = -/V| ata + iV modcl for the difference. The 
matched error scale shows no apparent bias and an rms scatter of 
1.08 that is very close to Poissonian (= 1). All bins with 10 or less 
objects (at the tails of the redshift range) have been eliminated for 
this plot. In contrast, the added error scale shows drifting biases and 
an rms scatter roughly enlarged by y2. 




Figure 10. Photo-z quality (as in Fig. 1) using noisy data, noisy model and 
the matched error approach: The error estimates a z are now comparable to 
the true rms redshift scatter. Larger noise leads to fewer unambiguous PDFs 
(completeness). However, bias and outliers remain broadly as expected. 



The conclusion is that the matched error scale approach is 
an appropriate way to obtain estimates of the redshift distribution 
which are virtually correct within Poisson noise. 

We also repeat in Fig.[l0]overall photo-z performance figures 
for the noisy experiment using the matched error scale in the \ 2 - 
We find now that the rms Sz errors follow roughly the error esti- 
mates g z in the differential sample line, in contrast to the version in 
Sect. 4 that ignored the model errors. Bias and outlier rates remain 
as low as before, but the completeness, i.e. the fraction of unam- 
biguous PDFs, has dropped due to the larger errors and smoothing. 
We revisit the low-noise case in the following section. 

6.4 Revisiting the low-error case 

Armed with the understanding of the impact of smoothing scales 
onto estimated errors, we reconsider the results of Sect. 4.1, where 
we decided to use the canonical x 2_a Pproach while ignoring the 
model errors. Since the low-noise data were unaltered and ran- 
domly split into data and model sample, the two were on a common 
matched error scale to start with. The application of smoothing to 
the model only has caused an overestimation of the redshift errors 
(see Fig. 1), which can now be explained. Given equal errors and 
our choice of scale (Jdata = fmodei = o j , we expect to have over- 
estimated errors by up to v2; an added error scale could even have 
led to a factor of up to t/3, all depending on the relative degree of 
intrinsic scatter. The alternative of no smoothing suggested by the 
error scales was of c ourse ruled ou t by the d iscretisation effects. 

We note, that lOvaizu et alj d2008bl) estimated errors by 
smoothing their model with a top-hat kernel function, of course 
with the motivation to collect enough model objects for a good 
definition of the PDF. However, they do not find an increase in 
redshift errors as we do here, and we speculate that this may re- 
sult from photometric errors being much smaller than the intrinsic 
colour scatter and the density of their galaxy model sample being 
extremely high, so that moderate smoothing would introduce only 
little non-locality and have a very small effect. 

lust to prove how the incompatibility of smoothing scales 
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Figure 11. Photo-z quality (compare with differential lines in Fig. 1) for 
low-noise data using different smoothing scales: Right: The error estimates 
<r z are nearly identical to the true rms redshift scatter when using near- 
zero smoothing as suggested by the matched error scale. Larger smoothing 
overestimates tr z . Left: Smaller smoothing scales lead to larger noise in the 
PDF and thus higher outlier risks (~ 10% for a 3 = 0™01). 

propagates into inappropriate error estimates for the low-noise 
SDSS QSO data used here, we rerun the experiment from Sect. 4.1 
with two fixed scales of Oj = 0™01, close to the desired zero scale 
and Oj — 0™10, larger than the typical Odata levels of 0™04. The 
results are shown in Fig. II Hand compared to the original version 
using (Tdata- It is very clear that the near-zero smoothing produces 
an excellent correspondence between the Sz rms and the error es- 
timate o z (right panel). Larger smoothing scales shift the curve to 
the right towards progressively overestimated errors. 

However, the left panel demonstrates the expected downside 
of near-zero smoothing, which is a large (~10%) fraction of out- 
liers. These appear together with an increased fraction of objects 
classed to have an unambiguous PDF. A shrinking smoothing func- 
tion draws its PDF from fewer model objects, overlooks more true 
ambiguities (see Eqn. ITOb and produces more residual outliers. In 
contrast, the large smoothing scale of K" 1 pushed outlier rates be- 
low 1%, even to 0.1% in parts of the plot. 

6.5 A practical requirement: a constant data error scale 

The results above create a desire for two perhaps conflicting de- 
mands: (i) we want to derive PDFs for individual objects using the 
matched error scale, which requires a spatially homogeneous tar- 
get smoothing scale. ( ii) We want to keep smoothing scales on the 
order of the data errors in order to use the signal contained in the 
data rather than destroying it with further smoothing. However, if 
some parts of the data are much noisier than others, they will drive 
the requirements for the target smoothing scale. Hence, we ideally 
want to have a constant error across our data sample. 

If we are concerned only with bright objects, these may have 
small errors that may even be dominated by calibration noise and 
thus be approximately constant on a magnitude scale. On the con- 
trary, when we are concerned with faint objects and diverging mag- 
nitude errors, a flux scale is more useful. The errors of faint objects 
are essentially background noise and constant on a flux scale. Only 
objects that are brighter than the background have their flux errors 
growing due to their own Poisson noise or calibration noise. If these 
are to be treated at the same time as faint objects, a transformed flux 
scale could be introduced, which maps the mean error as a function 
of flux onto a constant function. The above procedures could then 
be exercised using transformed fluxes as object features. 

A problem remains even for transformed fluxes, when a large 
data set includes strong variations in observed depth, as the ideal 
flux transformation function changes with depth. This challenge is 



posed by strong variations in interstellar foreground absorption as 
well. Since object features need to be de-reddened, the intrinsic 
depth of the data set changes with the absorption level. In these 
cases, a data set and its model sample may need to be broken down 
into more homogeneous parts to allow for optimum treatment. 



7 CONCLUSIONS 

We have presented a method to obtain Bayesian photometric red- 
shifts using the x 2 -technique with empirical models. This approach 
is intended to combine in one framework the two complementary 
benefits of \ 2 -template fitting and of empirical training sets as used 
e.g. by neural networks. The advantage of x 2 - me thods is that a 
probability density function is created, which can be inspected for 
ambiguities arising from multiple peaks. The advantage of empiri- 
cal samples is that they can be made to match perfectly the distri- 
bution and calibration appropriate for the data sample, as opposed 
to templates that rely on negotiable assumptions. PDFs generated 
with imperfect templates can still be unreliable, and where tem- 
plate errors are taken into account in the x 2 , me y widen the PDF 
and increase the error estimates. 

Our method produces reliable statistically correct PDFs if a 
complete empirical model is available. For incomplete models we 
are able to quantify the mis-estimation risks associated with each 
individual object. A very simplified description of the comparison 
could be: Conventional NNs are accurate but unreliable with ambi- 
guities; x 2 -template fitting is less accurate, but guards itself against 
unreliability with PDFs and template errors; the new x 2 -empirical 
method is both accurate and reliable. 

We used a data set full of ambiguities to demonstrate that the 
method delivers its promises, i.e. the SDSS DR5 QSO sample with 
~ 75, 000 objects, split half and half into a data and a model sam- 
ple. Objects with unambiguous PDFs show less than 1% outliers, 
typical redshift errors < 0.05 and vanishing redshift bias. At higher 
redshift (z > 2.5) these figures are a factor of ~ 2 better. The 
outliers purely result from the limited size of the model sample, 
while the rms errors are dominated by the instrinsic variety of QSO 
colours given the information content in the survey data. 

Objects with PDFs classed as ambiguous correctly evaluate 
the relative probability of the two possible solutions. This provides 
either accurate weighting factors when using both interpretations 
for an object in a later analysis, or an accurate outlier risk when us- 
ing only the more probable solution. In the latter case, our method 
predicted that in 78.0% of ambiguous objects the more probable 
peak in the PDF would be the correct one, which was then found to 
be true for 77.9% of them, different by less than Poisson noise. 

The method had been inspired by the template-bas ed photo-z 
code emplo yed in the CADIS and COMBO-17 surveys dWolf et al.l 
Il999ll200ll) . except that it replaces template realisations on a grid 
with empirical model objects. It is thus also capable of classifying 
objects with Bayesian probabilities into stars, galaxies, QSOs and 
into various subclasses. 

For noisy data we propose a matched error approach, which 
is designed to compare data and model at common resolution 
in colour space. This translates into a x 2 ' enor scale given by 
°data — °model> anc l we show that this method provides accurate 
error estimates. In contrast, adding data and model errors in the 
X 2 -expression broadens the probability distribution and thus over- 
estimates the rms redshift errors. Finally, we show that the matched 
error scale in the x 2 -empirical method reconstructs the redshift dis- 
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tribution of a noisy data sample practically within Poissonian n(z)- 
errors, if a complete albeit noisy empirical model is available. 

The method is most easily implemented when object features 
can be transformed onto a scale where data errors are constant and 
model errors are smaller than data errors. Then the model smooth- 
ing is provided by matching the error scales in the \ 2 expression 
and no data resampling is required. In this case, the procedure is 
computationally very fast; e.g. the QSO sample in this work was 
processed in 20 minutes on a year 2004 PowerPC Mac laptop. 

Empirical models are more representative of the data, and thus 
the derived PDFs are substantially more accurate than PDFs de- 
rived from template fitting, allowing to trust redshifts, ambiguities 
and outlier risk evaluations, which is critical for understanding sys- 
tematics in large photo-z data sets. However, their limitations arise 
principally from the size and completeness of the model sample. 
Redshift-selective incompleteness as it often appears at the faint 
end of spectroscopic surveys translates into a massive undetectable 
outlier risk that can far exceed any of the other performance lim- 
itations. While such incompleteness is the main challenge for any 
empirical method, we provide a framework to evaluate catastrophic 
risks for individual objects as to allow for their separate handling. 

An important application of future photo-z work is in mas- 
sive cosmological surveys for galaxy photo-z's, which will indeed 
require superb control of systematics such as redshift biases and 
outliers. The results presented here for QSOs are not applicable to 
galaxies in a quantitative sense, but our use of QSOs was moti- 
vated by the rich ambiguities present, which for galaxies are only 
expected in future large samples. When they become available, 
they will benefit just as well from our method that derives robust 
Bayesian photo-z's from empirical samples and evaluates residual 
risks for outlier rates and photo-z biases. 
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